Search CORE

294 research outputs found

A complexity analysis of statistical learning algorithms

Author: Kon Mark A.
Publication venue
Publication date: 18/12/2012
Field of study

We apply information-based complexity analysis to support vector machine (SVM) algorithms, with the goal of a comprehensive continuous algorithmic analysis of such algorithms. This involves complexity measures in which some higher order operations (e.g., certain optimizations) are considered primitive for the purposes of measuring complexity. We consider classes of information operators and algorithms made up of scaled families, and investigate the utility of scaling the complexities to minimize error. We look at the division of statistical learning into information and algorithmic components, at the complexities of each, and at applications to support vector machine (SVM) and more general machine learning algorithms. We give applications to SVM algorithms graded into linear and higher order components, and give an example in biomedical informatics

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

On the probabilistic continuous complexity conjecture

Author: Kon Mark A.
Publication venue
Publication date: 01/01/2012
Field of study

In this paper we prove the probabilistic continuous complexity conjecture. In continuous complexity theory, this states that the complexity of solving a continuous problem with probability approaching 1 converges (in this limit) to the complexity of solving the same problem in its worst case. We prove the conjecture holds if and only if space of problem elements is uniformly convex. The non-uniformly convex case has a striking counterexample in the problem of identifying a Brownian path in Wiener space, where it is shown that probabilistic complexity converges to only half of the worst case complexity in this limit

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

The Marr Conjecture and Uniqueness of Wavelet Transforms

Author: Allen Ben
Kon Mark
Publication venue
Publication date: 15/04/2015
Field of study

The inverse question of identifying a function from the nodes (zeroes) of its wavelet transform arises in a number of fields. These include whether the nodes of a heat or hypoelliptic equation solution determine its initial conditions, and in mathematical vision theory the Marr conjecture, on whether an image is mathematically determined by its edge information. We prove a general version of this conjecture by reducing it to the moment problem, using a basis dual to the Taylor monomial basis

x^\alpha

\mathbb {R}^n

.Comment: 52 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

On Some Integrated Approaches to Inference

Author: Kon Mark A.
Plaskota Leszek
Publication venue
Publication date: 01/01/2012
Field of study

We present arguments for the formulation of unified approach to different standard continuous inference methods from partial information. It is claimed that an explicit partition of information into a priori (prior knowledge) and a posteriori information (data) is an important way of standardizing inference approaches so that they can be compared on a normative scale, and so that notions of optimal algorithms become farther-reaching. The inference methods considered include neural network approaches, information-based complexity, and Monte Carlo, spline, and regularization methods. The model is an extension of currently used continuous complexity models, with a class of algorithms in the form of optimization methods, in which an optimization functional (involving the data) is minimized. This extends the family of current approaches in continuous complexity theory, which include the use of interpolatory algorithms in worst and average case settings

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Relationships among Interpolation Bases of Wavelet Spaces and Approximation Spaces

Author: Kon Mark A.
Zhang Zhiguo
Publication venue
Publication date: 21/12/2012
Field of study

A multiresolution analysis is a nested chain of related approximation spaces.This nesting in turn implies relationships among interpolation bases in the approximation spaces and their derived wavelet spaces. Using these relationships, a necessary and sufficient condition is given for existence of interpolation wavelets, via analysis of the corresponding scaling functions. It is also shown that any interpolation function for an approximation space plays the role of a special type of scaling function (an interpolation scaling function) when the corresponding family of approximation spaces forms a multiresolution analysis. Based on these interpolation scaling functions, a new algorithm is proposed for constructing corresponding interpolation wavelets (when they exist in a multiresolution analysis). In simulations, our theorems are tested for several typical wavelet spaces, demonstrating our theorems for existence of interpolation wavelets and for constructing them in a general multiresolution analysis

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Author: DeLisi Charles
Fan Yue
Kon Mark
Publication venue
Publication date: 09/05/2018
Field of study

We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

BowSaw: inferring higher-order trait interactions associated with complex biological phenotypes

Author: Dimucci Demetrius
Kon Mark
Segre Daniel
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 12/02/2020
Field of study

Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g. from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue towards new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset, and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.Accepted manuscrip

Boston University Institutional Repository (OpenBU)

PubMed Central